Locating privileged spreaders on an Online Social Network 
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Social media have provided plentiful evidence of their capacity for information diffusion. Fads 
and rumors, but also social unrest and riots travel fast and affect large fractions of the popula- 
tion participating in online social networks (OSNs). This has spurred much research regarding the 
mechanisms that underlie social contagion, and also who (if any) can unleash system-wide infor- 
mation dissemination. Access to real data, both regarding topology -the network of friendships- 
and dynamics -the actual way in which OSNs users interact-, is crucial to decipher how the former 
facilitates the latter's success, understood as efficiency in information spreading. With the quanti- 
tative analysis that stems from complex network theory, we discuss who (and why) has privileged 
spreading capabilities when it comes to information diffusion. This is done considering the evolution 
of an episode of political protest which took place in Spain, spanning one month in 2011. 

PACS numbers: 89.20. Hh,89.65.-s,89. 75. Fb,89.75.Hc 
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I. INTRODUCTION 

The question about how a piece of information (a virus, 
a rumor, an opinion, etc.,) is globally spread over a net- 
work, and which ingredients are necessary to achieve such 
a success, has motivated much research recently. The rea- 
son behind this interest is that identifying key aspects 
of spreading phenomena facilitates the prevention (e.g., 
minimizing the impact of a disease) or the optimization 
(e.g. the enhancement of viral marketing) of diffusion 
processes that can reach system wide scales. In the con- 
text of political protest or social movements, information 
diffusion plays a key role to coordinate action and to 
keep adherents informed and motivated [l| . Understand- 
ing the dynamics of such diffusion is important to locate 
who has the capability to transform the emission of a 
single message into a global information cascade, affect- 
ing the whole system. These are the so-called "privileged 
or influential spreaders" . Beyond purely sociological as- 
pects, some valuable lessons might be extracted from 
the study of this problem. For instance, current viral 
marketing techniques (which capitalizes on online social 
networks) could be improved by encouraging customers 
to share product information with their acquaintances. 
Since people tend to pay more attention to friends than 
to advertisers, targeting privileged spreaders at the right 
time may enhance the efficiency of a given campaign. 

The prominence (importance, popularity, authority) of 
a node has however many facets. From a static point of 
view, an authority may be characterized by the num- 
ber of connections it holds, or the place it occupies in a 
network. This is the idea put forward in Q, where the 
authors seek the design of efficient algorithms to detect 
particular (sub)graph structures: hierarchies and tree- 
like structures. Turning to dynamics, a node may become 
popular because of the attention it receives in short inter- 
vals of time 3 -but that is a rather volatile way of being 
important, because it depends on activity patterns that 



change in the scale of hours or even minutes. A more 
lasting concept of influence comprises both a topological 
-enduring- ingredient and the dynamics it supports; this 
is the case of Centola's "reinforcing signals" Q or the 
fc-core [H, which we follow here. 

In this paper, we approach the problem of influential 
spreaders taking into consideration data from the Span- 
ish "15M movement" @. This pacific civil movement is 
an example of the social mobilizations -from the "Arab 
spring" to the "Occupy Wall-Street" movement - that 
have characterized 2011. Although whether OSNs have 
been fundamental instruments for the successful organi- 
zation and evolution of political movements is not firmly 
established, it is increasingly evident 0] that at least they 
have been nurtured mainly in OSNs (Facebook, Twitter, 
etc.) before reaching classic mass media. Data from these 
grassroots movements -but also from less conflictive phe- 
nomena in the Web 2.0- provide a unique opportunity to 
observe system-wide information cascades. In particular, 
paying attention to the network structure allows for the 
characterization of which users have outstanding roles 
for the success of cascades of information. Our results 
complement some previous findings regarding dynamical 
influence both at the theoretical @ and the empirical [l[ 
levels. Besides, our analysis of activity cascades reveals 
distinctive traits in different phases of the protests, which 
provides important hints for future modeling efforts. 



II. DATA: A NETWORKED VIEW OF THE 15M 
MOVEMENT 

The "15M movement" is a still ongoing civic initiative 
with no party or union affiliation that emerged as a re- 
action to perceived political alienation and to demand 
better channels for democratic representation. The first 
mass demonstration, held on Sunday May 15 (D from 
now on), was conceived as a protest against the man- 



agement of the economy in the aftermath of the finan- 
cial crisis. After the demonstrations on day D, hundreds 
of participants decided to continue the protests camping 
in the main squares of several cities (Puerta del Sol in 
Madrid, Plaga de Catalunya in Barcelona) until May 22, 
the following Sunday and the date for regional and local 
elections. 

From a dynamical point of view, the data used in this 
study are a set of messages (tweets) that were publicly 
exchanged through www. twitter.com. The whole time- 
stamped data collected comprises a period of one month 
(between April 25th, 2011 at 00:03:26 and May 26th, 
2011 at 23:59:55) and it was archived by Cierzo De- 
velopment Ltd., a start-up company. To filter out the 
whole sample and choose only those messages related to 
the protests, 70 keywords (hashtags) were selected, those 
which were systematically used by the adherents to the 
demonstrations and camps. The final sample consists of 
535,192 tweets. On its turn, these tweets were generated 
by 85,851 unique users (out of a total of 87,569 users of 
which 1,718 do not show outgoing activity, i.e., they are 
only receivers). See Q for more details. 

Twitter is most frequently used as a broadcasting plat- 
form. Users subscribe to what other users say building 
a "who- listens-to- whom" network, i.e., that made up of 
followers and followings in Twitter. This means that any 
emitted message from a node will be immediately avail- 
able to anyone following him, which is of utmost impor- 
tance to understand the concept of activity cascade in the 
next sections. Such relationships offer an almost-static 
view of the relationships between users, the "follower net- 
work" for short. To build it, data for all the involved 
users were scrapped directly from www. twitter, com. The 
scrap was successful for the 87,569 identified users, for 
whom we also obtained their official list of followers 
restricted to those who had some participation in the 
protests. The resulting structure is a directed network, 
direction indicates who follows who in the online social 
platform. In practice, we take this underlying structure 
as completely static (does not change through time) be- 
cause its time scale is much slower, i.e., changes occur 
probably in the scale of weeks and months. In-degree fc,„ 
expresses the amount of users a node is following; whereas 
out-degree represents the amount of users who follow a 
node. This network exhibits a high level of reciprocity: 
a typical user holds many reciprocal relationships (with 
other users who the node probably knows personally), 
plus a few unreciprocated nodes which typically point at 
hubs. 

The main topological features of the follower network 
fit well in the concept of "small- world" @, i.e., low av- 
erage shortest path length and high clustering coeffi- 
cient. Furthermore, both in- and out-degree distribute 
as a power-law, indicating that connectivity is extremely 
heterogenous. Thus, the network supporting users' in- 
teractions is scale-free with some rare nodes that act as 
hubs @. 
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FIG. 1: (Color online) Temporal evolution of the activity in 
the online social network. In green, the proportion of nodes 
that had shown some activity at a certain time t. In yellow, 
the cumulative proportion of emitted messages as a function 
of time. Note that the two lines evolve in almost the same way. 
According to this evolution, we have distinguished two sub- 
periods: one of them characterized as "slow growth" due to 
the low activity level and the other one tagged as "explosive" 
or "bursty" due to the intense information traffic within it. 



III. METHODS 
A. Activity cascades 

An activity cascade -or simply "cascade", for short-, 
starting at a seed, occurs whenever a piece of information 
-or replies to it- are (more or less unchanged) repeatedly 
forwarded towards other users. If one of those who "hear" 
the piece of information decides to reply to it, he becomes 
a spreader, otherwise he remains as a mere listener. The 
cascade becomes global if the final number of affected 
users N c (including the set of spreaders and listeners, 
plus the seed) is comparable to the size of the whole 
system N. Intuitively, the success of an activity cascade 
greatly depends on whether spreaders have a large set 
of followers or not (Figure [5]); remarkably, the seed is 
not necessarily very well connected. This fact highlights 
the entanglement between dynamics and the underlying 
(static) structure. 

Note that the previous definition is too general to at- 
tain an operative notion of cascade. One possibility is 
to leave time aside, and consider only identical pieces 
of information traveling across the topology (a retweet, 
in the Twitter jargon). This may lead to inconsisten- 
cies, such as the fact that a node decides to forward a 
piece of information long after receiving it (perhaps days 
or weeks). It is impossible to know whether his action 
is motivated by the original sender, or by some exoge- 
nous reason, i.e., invisible to us. One may, alternatively, 
take into consideration time, thus considering that, re- 
gardless of the exact content of a message, two nodes 




FIG. 2: (Color online) The figure illustrates the concept of cascade that is used throughout this article. User 1 emits a message 
at time t, and all of his followers automatically receive it. Thus, they are already counted as part of the cascade (small red 
circles). One of his followers (user 2, big blue node), driven by the previous message, decides himself to participate at time 
t + At, posting a message himself. A second set of followers are included in the cascade. Finally, a third node (user 3, big green 
circle) joins in and spreads the cascade further at time t + 2At. A node can not be counted twice, note for example that user 
4 is also following node 3. Many nodes remain unaffected, because they are not connected to any of the spreaders. The final 
size of the cascade is ^ = || ; the success of the cascade largely depends on the capacity to contact a "leader" or "privileged 
spreader", i.e., a hub to whom many people listens and who decides to participate. The interesting point, however, is that the 
number of spreaders needed to attain such success is very low (3), and over 50% of the cascade is triggered by just one of them. 



belong to the same cascade as consecutive spreaders if 
they arc connected (the latter follows the former) and 
they show activity within a certain (short) time interval, 
At. The probability that exogenous factors are leading 
activation is in this way minimized. Also, this concept 
of cascade is more inclusive, regarding dialogue-like mes- 
sages (which, we emphasize, are typically produced in 
short time spans). This scheme exploits the concept of 
spike train from neuroscience, i.e., a series of discrete ac- 
tion potentials from a neuron taken as a time series. At a 
larger scale, two brain regions arc identified as function- 
ally related if their activation happens in the same time 
window. Consequently, message chains are reconstructed 
assuming that activity is contagious if it takes place in 
short time windows. 

We apply the latter definition to explore the occur- 
rence of information cascades in the data. In practice, 
we take a seed message posted by i at time to and mark 
all of i's followers as listeners. We then check whether any 
of these listeners showed some activity at time to + At. 
This is done recursively until no other follower shows ac- 
tivity, see Figure [5J In our scheme, a node can only 
belong to one cascade; this constraint introduces a bias 



in the measurements, namely, two nodes sharing a fol- 
lower may show activity at the same time, so their fol- 
lower may be counted in one or another cascade (with 
possible important consequences regarding average cas- 
cades' size and penetration in time). To minimize this 
degeneration, we perform calculations for many possi- 
ble cascade configurations, randomizing the way we pro- 
cess data. We distinguish information cascades (or just 
cascades, for short) from spreader-cascades. In informa- 
tion cascades we count any affected user (listeners and 
spreaders), whereas in spreader-cascades only spreaders 
are taken into account. 

We measure cascades and spreader-cascades size dis- 
tributions for three different scenarios: one in which the 
information intensity is low (slow growth phase, from 
D — 20 to D — 10), one in which activity is bursty (explo- 
sive phase, D— 2 to D+6) and one that considers all avail- 
able data (which spans a whole month, and includes the 
two previous scenarios plus the time in-between, D — 20 
to D + 10). Figure Q] illustrates these different periods. 
The green line represents the cumulative proportion of 
nodes in the network that had shown some activity, i.e., 
had sent at least one message, measured by the hour. 
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FIG. 3: (Color online) Upper panels (a,b,c): Cascade size probability distributions for the different periods considered. Lower 
panels (d,e,f): Probability distributions of spreaders involved in the cascades for the same periods. The exact periods considered 
in the analyses are indicated at the top of each panel. See the text for further details. 



We tag the first 10 days of study as "slow growth" be- 
cause, for that period, the amount of active people grew 
less than 5% of the total of users, indicating that recruit- 
ment for the protests was slow at that time. The opposite 
arguments apply in the case of the bursty or "explosive" 
phase: in only 8 days the amount of active users grew 
from less than 10% up to over an 80%. The same can 
be said about global activity (in terms of the total num- 
ber of emitted directed messages -the activity network), 
which shows an almost exact growth pattern. Besides, 
within the different time periods -slow growth, explosive 
and total-, different time windows have been set to as- 
sess the robustness of our results. Our proposed scheme 
relies on the contagious effect of activity, thus large time 
windows, i.e., At > 24 hours, are not considered. 



B. fc-shell decomposition 

The fc-core decomposition of a network consists of iden- 
tifying particular subsets of the network, called fc-cores, 
each obtained by recursively removing all the vertices of 
degree less than k, where k — ki n + k out indicates the to- 
tal number of in- and out-going links of a node, until all 
vertices in the remaining graph have degree at least k. In 
the end, each node is assigned a natural number (its core- 
ness), the higher the coreness the closer a node is to the 
nucleus or core of the network. The main advantage of 
this centrality measure is, in front of other quantities, its 
low computational cost that scales as 0(N + E), where 
N is the number of vertices of the graph and E is the 
number of links it contains [Til ] . This decomposition has 
been successfully applied in the analysis of the Internet 
and the Autonomous Systems structure [Tl], [l2| ■ In the 
following section, we will use the fc-core decomposition 
as a means to identify influence in social media. In par- 
ticular, we discuss which, degree or coreness, is a better 
predictor of the extent of an information cascade. 



IV. RESULTS 

The upper panels (a, b, c) of Figure [3] reflect that a cas- 
cade of a size O(N) can be reached at any activity level 
(slow growth, explosive or both). As expected, these 
large cascades occur rarely as the power-law probabil- 
ity distributions evidence. This result is robust to dif- 
ferent temporal windows up to 24h. In contrast, lower 
(d, e, /) panels show significant differences between peri- 
ods. Specifically, the distribution of involved spreaders in 
the different scenarios changes radically from the "slow 
growth" phase (Figurc[3pl) to the "explosive" period (Fig- 
ure [3f); the distribution that considers the whole period 
of study just reflects that the bursty period (in which 
most of the activity takes place) dominates the statistics. 
The importance of this difference is that one may con- 
clude that, to attain similar results a proportionally much 
smaller amount of spreaders is needed in the slow growth 
period. Going to the detail, however, it seems clear (and 
coherent with the temporal evolution of the protests, Fig. 
[U that although cascades in the slow period (panel a) af- 
fect as much as N/2 of the population, the system is in 
a different dynamical regime than in the explosive one: 
indeed, distributions suggest that there has been a shift 
from a subcritical to a supercritical phase. 

The previous conclusions raise further questions: is 
there a way to identify "privileged spreaders" ? Are they 
placed randomly throughout the network's topology? Or 
do they occupy key spots in the structure? And, will 
these influential users be more easily detected in a bursty 
period (where large cascades occur more often)? In what 
context will influential spreaders single out? To answer 
these questions, we capitalize on previous work suggest- 
ing that centrality (measured as the fc-core) enhances 
the capacity of a node to be key in disease spreading 
processes The authors in [j| discussed whether the 
degree of a node (its total number of neighbors, k) or 
its fc-core (a centrality measure) can better predict the 
spreading capabilities of such node. Note that the k- 
shcll decomposition splits a network in a few levels (over 
a hundred), while node degrees can range from one or 
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FIG. 4: (Color online) Left upper panel: average spreading capacity (with respect to the system size) of nodes grouped according 
to their fc-core. grows with coreness, but the explosive period (red squares) evidences a much less clear tendency, with many 
fluctuations and a lower overall spreading capacity if compared to the slow growth period (black circles). Left lower panel: 
The same information is showed as a function of the degree. Again, the slow growth period is the best one at predicting the 
extent of a cascade. Interestingly, average cascades for highest degrees outperform those triggered by highest fc-core nodes by 
an order of magnitude. See main text for discussion on this aspect. Right panels show the fc-core and degree distributions, i.e., 
how many nodes belong to each class. Note that the highest core contains over 1000 users. 



two up to several thousands. 

We have explored the same idea, but in relation to ac- 
tivity cascades which are the object of interest here. The 
upper left panel of Fig. @] shows the spreading capabil- 
ities as a function of classes of fc-cores. Specifically, we 
take the seed of each particular cascade and save its core- 
ness and the final size of the cascade it triggers. Having 
done so for each cascade, we can average the success of 
cascades for a given core number. Remarkably, for every 
scenario under consideration (slow, explosive, whole), a 
higher core number yields larger cascades. This result 
supports the ideas developed in Q, but it is at odds 
with those reported in [l3j , which shows that the fc-corc 
of a node is not relevant in rumor dynamics. Exactly 
the same conclusion (and even more pronounced) can be 
drawn when considering degree (lower left panel), which 
appears to be in contradiction with the mentioned pre- 
vious evidence @. 

At a first sight, our findings seem to point out that if 
privileged spreaders are to be found, one should simply 
identify the individuals who are highly connected. How- 
ever, this procedure might not be the best choice. The 
right panels in Figure |4] show the fc-core (upper) and de- 
gree (lower) distributions, indicating the number of nodes 



which are seeds at one time or another, classified in terms 
of their coreness or degree. Unsurprisingly, many nodes 
belong to low cores and have low degrees. The interest 
of these histograms lies however in the tails of the dis- 
tributions, where one can see that, while there are a few 
hundred nodes in the high cores (and even over a thou- 
sand in the last core), highest degrees account only for 
a few dozen of nodes. In practice, this means that by 
looking at the degree of the nodes, we will be able to 
identify quite a few influential spreaders (the ones that 
produce the largest cascades). However, the number of 
such influential individuals are far more than a few. As a 
matter of fact, high cascading capabilities are distributed 
over a wider range of cores, which in turn contain a sig- 
nificant number of nodes. Focusing on Fig. 01 note that 
triggering cascades affecting over 10 -2 of the network's 
population demands nodes with fc > 10 3 . Checking the 
distribution of degrees (right-hand side) , it is easy to see 
that an insignificant amount of nodes display such degree 
range. In the same line, we may wonder what it takes 
to trigger cascades affecting over 10~ 2 of the network's 
population, from the fc-core point of view. In this case, 
nodes with fc-core around 125 show such capability. A 
quick look at the core distribution yields that over 1500 
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nodes accomplish these conditions, i.e., they belong to 
the 125th /c-shell or higher. 

We may now distinguish between scenarios in Figure |U 
While any of the analyzed periods shows a growing ten- 
dency, i.e., cascades are larger the larger is the considered 
descriptor, we highlight that it is in the slow growth pe- 
riod (black circles) where the tendency is more clear, i.e., 
results are less noisy. Between the other two periods, 
the explosive one (red squares) is distinctly the less ro- 
bust, in the sense that cascade sizes oscillate very much 
across fc-cores, and the final plot shows a smaller slope 
than the other two. This subtle fact is again of great im- 
portance: it means that during "information storms" a 
large cascade can be triggered from anywhere in the net- 
work (and, conversely, small cascades may have begun 
in important nodes). The reason for this is that in pe- 
riods where bursty activity dominates the system suffers 
"information overflow" , the amount of noise flattens the 
differences between nodes. For instance, in these periods 
a node from the periphery (low coreness) may balance 
his unprivileged situation by emitting messages very fre- 
quently. This behavior yields a situation in which, from 
a dynamical point of view, nodes become increasingly in- 
distinguishable. The plot corresponding to the whole pe- 
riod analyzed (green triangles) lies consistently between 
the other two scenarios, but closer to the relaxed pe- 
riod. This is perfectly coherent, the study spans for 30 
days and the explosive period represents only 25% of it, 
whereas the relaxed period stands for over 33%. Further- 
more, those days between D — 10 and D — 2, and beyond 
D + 6, resemble the relaxed period as far as the flow of 
information is concerned. 



V. CONCLUSIONS 

Online social networks are called to play an ever in- 
creasing role in shaping many of our habits (be them 
commercial or cultural) as well as in our position in front 
of political, economical or social issues not only at a lo- 
cal, country-wide level, but also at the global scale. It is 
thus of utmost importance to uncover as many aspects 
as possible about topological and dynamical features of 
these networks. One particular aspect is whether or not 
one can identify, in a network of individuals with com- 



mon interests, those that are influcntials to the rest. Our 
results show that the degree of the nodes seems to be 
the best topological descriptor to locate such influential 
individuals. However, there is an important caveat: the 
number of such privileged seeds is very low as there are 
quite a few of these highly connected subjects. On the 
contrary, by ranking the nodes according to their fc-core 
index, which can be done at a low computational cost, 
one can safely locate the (more abundant in number) 
individuals that are likely to generate large (near to) 
system- wide cascades. The results here presented also 
lead to a surprising conclusion: periods characterized by 
explosive activity are not convenient for the spreading of 
information throughout the system using influential in- 
dividuals as seeds. This is because in such periods, the 
high level of activity -mainly coming from users which 
are badly located in the network- introduces noise in the 
system. Consequently, influential individuals lose their 
unique status as generators of system wide cascades and 
therefore their messages are diluted. 

On more general grounds, our analysis of real data 
remarks the importance of empirical results to validate 
theoretical contributions. In particular, Fig. SI together 
with the observations in raises some doubts about 
rumor dynamics as a good proxy to real information dif- 
fusion. We hypothesize that such models approach infor- 
mation diffusion phenomena in a too simplistic way, thus 
failing to comprise relevant mechanisms such as complex 
activity patterns Finally, although the underly- 

ing topology may be regarded as constant, any modeling 
effort should also contemplate the time evolution of the 
dynamics. Indeed, Fig. [3] suggests that the system is in 
a sub-critical phase when activity level is low, and crit- 
ical or supercritical during the explosive period. This is 
related to the rate at which users are increasingly be- 
ing recruited as active agents, i.e. the speed at which 
listeners become spreaders. 
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